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It is not surprising that extended-response items, typically short essays, are now an 
integral part of most large-scale assessments. Extended response items provide an 
opportunity for students to demonstrate a wide range of skills and knowledge, including 
higher order thinking skills such as synthesis and analysis. Yet assessing students' 
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writing is one of the most expensive and time-consuming activities for assessment 
programs. Prompts need to be designed, rubrics created, multiple raters need to be 
trained, and then the extended responses need to be scored, typically by multiple 
raters. With different people evaluating different essays, interrater reliability becomes an 
additional concern in the writing assessment process. Even with rigorous training, 
differences in the background, training, and experience of the raters can lead to subtle 
but important differences in grading. 

Computers and artificial intelligence have been proposed as tools to facilitate the 
evaluation of student essays. In theory, computer scoring can be faster, reduce costs, 
increase accuracy and eliminate concerns about rater consistency and fatigue. Further, 
the computer can quickly re-score materials should the scoring rubric be redefined. This 
articles describes the three most prominent approaches to essay scoring. 

SYSTEMS 

The most prominent writing assessment programs are: 

*Project Essay Grade (PEG), introduced by Ellis Page in 1966, 

intelligent Essay Assessor (IEA), first introduced for essay grading in 1997 by Thomas 
Landauer and Peter Foltz, and 

*E-rater, used by Educational Testing Service (ETS) and developed by Jill Burstein. 

Descriptions of these approaches can be found at the web sites listed at the end of this 
article and in Whittington and Hunt (1999) and Wresch (1993). 

Page uses a regression model with surface features of the text (document length, word 
length, and punctuation) as the independent variables and the essay score as the 
dependent variable. Landauer's approach is a factor-analytic model of word 
co-occurrences which emphasizes essay content. Burstein uses a regression model 
with content features as the independent variables. 

PEG - PEG grades essays predominantly on the basis of writing quality (Page, 1994). 
The underlying theory is that there are intrinsic qualities to a person's writing style called 
trins that need to be measured, analogous to true scores in measurement theory. PEG 
uses approximations of these variables, called proxes, to measure these underlying 
traits. Specific attributes of writing style, such as average word length, number of 
semicolons, and word rarity are examples of proxes that can be measured directly by 
PEG to generate a grade. For a given sample of essays, human raters grade a large 
number of essays (100 to 400), and determine values for up to 30 proxes. The grades 
are then entered as the criterion variable in a regression equation with all of the proxes 
as predictors, and beta weights are computed for each predictor. For the remaining 
unscored essays, the values of the proxes are found, and those values are then 
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weighted by the betas from the initial analysis to calculate a score for the essay. 

Page has over 30 years of research consistently showing exceptionally high 
correlations. In one study, Page (1994) analyzed samples of 495 and 599 senior essays 
from the 1998 and 1990 National Assessment of Educational Progress using responses 
to a question about a recreation opportunity: whether a city government should spend 
its recreation money fixing up some abandoned railroad tracks or converting an old 
warehouse to new uses. With 20 variables, PEG reached multiple Rs as high as .87, 
close to the apparent reliability of the targeted judge groups. 

IEA - First patented in 1989, IEA was designed for indexing documents for information 
retrieval. The underlying idea is to identify which of several calibration documents are 
most similar to the new document based on the most specific (i.e., least frequent) index 
terms. For essays, the average grade on the most similar calibration documents is 
assigned as the computer-generated score (Landauer, Foltz, Laham, 1998). 

With IEA, each calibration document is arranged as a column in a matrix. A list of every 
relevant content term, defined as a word, sentence, or paragraph, that appears in any of 
the calibration documents is compiled, and these terms become the matrix rows. The 
value in a given cell of the matrix is an interaction between the presence of the term in 
the source and the weight assigned to that term. Terms not present in a source are 
assigned a cell value of 0 for that column. If a term is present, then the term may be 
weighted in a variety of ways, including a 1 to indicate that it is present, a tally of the 
number of times the term appears in the source, or some other weight criterion 
representative of the importance of the term to the document in which it appears or to 
the content domain overall. 

Each essay to be graded is converted into a column vector, with the essay representing 
a new source with cell values based on the terms (rows) from the original matrix. A 
similarity score is then calculated for the essay column vector relative to each column of 
the rubric matrix. The essay's grade is determined by averaging the similarity scores 
from a predetermined number of sources with which it is most similar. Their system also 
provides a great deal of diagnostic and evaluative feedback. As with PEG high 
correlations between IEA scores and human scored essays have been reported 

E-rater - The Educational Testing Service's Electronic Essay Rater (e-rater) is a 
sophisticated "Hybrid Feature Technology" that uses syntactic variety, discourse 
structure (like PEG) and content analysis (like IEA). To measure syntactic variety, 
e-rater counts the number of complement, subordinate, infinitive, and relative clause 
and occurrences of modal verbs (would, could) to calculate ratios of these syntactic 
features per sentence and per essay. For structure analysis, e-rater uses 60 different 
features, similar to PEG'S praxes. Two indices are created to evaluate the similarity of 
the target essay's content to the content of calibrated essays. As described by Burstein, 
et.al (1998), in their EssayContent analysis module, the vocabulary of each score 
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category is converted to a single vector whose elements represent the total frequency of 
each word in the training essays for that holistic score category. The system computes 
correlations between the vector for a given test essay and the vectors representing the 
trained categories. The score that is most similar to the test essay is assigned as the 
evaluation of its content. E-rater's ArgContent analysis module is based on the inverse 
document frequency, like IEA. The word frequency vectors for the score categories are 
converted to vectors of word weights. Scores on the different components are weighted 
using regression to predict human grader's scores. 

ANALYSIS 



Several studies have reported favorably on PEG, IEA, and e-rater. A review of the 
research on IEA found that its scores typically correlate as well with human raters as the 
raters do with each other (Chung & O'Neil, 1997). Research on PEG consistently 
reports relatively high correlations between PEG and human graders relative to 
correlations between human graders (e.g., Page, Poggio, & Keith, 1997). E-rater was 
deemed so impressive it is now operational and used to score the General Management 
Aptitude Test (GMAT). All of the systems return grades that correlate significantly and 
meaningfully with those of human raters. 

Compared to IEA and e-rater, PEG has the advantage of being conceptually simpler 
and less taxing on computer resources. PEG is also the better choice for evaluating 
writing style, as IEA returns grades that have literally nothing to do with writing style. 

IEA and e-rater, however, appear to be the superior choice for grading content, as PEG 
relies on writing quality to determine grades. 

All three of these systems are proprietary and details of the exact process are not 
generally available. We do not know, for example, what variables are in any model nor 
their weights. The use of automated essay scoring is also somewhat controversial. A 
well-written essay about baking a cake could receive a high score if PEG were used to 
grade essays about causes of the American Civil War. Conceivably, IEA could be 
tricked into giving a high score to an essay that was a string of relevant words with no 
sentence structure whatsoever. E-rater appears to overcome some of these criticisms at 
the expense of being fairly complicated. These criticisms are more problematic for PEG 
than for IEA and e-rater. 

One should not expect perfect accuracy from any automated scoring approaches. The 
correlation of human ratings on state assessment constructed-response items is 
typically only .70 - .75. Thus, correlating with human raters as well as human raters 
correlate with each other is not a very high, nor very meaningful, standard. Because the 
systems are all based on normative data, the current state of the art does not appear 
conducive for scoring essays that call for creativity or personal experiences. The 
greatest chance of success for essay scoring appears to be for long essays that have 
been calibrated on large numbers of examinees and which have a clear scoring rubric. 
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Those who are interested in pursuing essay scoring may be interested in the Bayesian 
Essay Test Scoring s Ystem (BETSY), being developed by the author based on the 
naive Bayes text classification literature. Free software is available for research use. 

While recognizing the limitations, perhaps it is time for states and other programs to 
consider automated scoring services. We don't advocate abolishing human raters. 
Rather we can envision the use of any of these technologies as a validation tool with 
each essay scored by one human and by the computer. When the scores differ, the 
essay would be flagged for a second read. This would be quicker and less expensive 
than current practice. 

We would also like to see retired essay prompts used as instructional tools. The retired 
essays and grades can be used to calibrate a scoring system. The entire system could 
then be made available to teachers to help them work with students on writing and 
high-order skills. The system could also be coupled with a wide range of diagnostic 
information, such as the information currently available with IEA. 

KEY WEB SITES 



PEG - http://134.68.49.185/pegdemo/ref.asp 
IEA - http://www.knowledge-technologies.com/ 

E-rater - http://www.ets.org/research/erater.html 

Betsy - http://ericae.net/betsy/ 
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